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Abstract 

Group-based sparsity models are proven instrumental in linear regression problems for recovering signals 
from much fewer measurements than standard compressive sensing. The main promise of these models is the 
recovery of "interpretable" signals along with the identification of their constituent groups. To this end, we establish 
a combinatorial framework for group-model selection problems and highlight the underlying tractability issues 
revolving around such notions of interpretability when the regression matrix is simply the identity operator. We 
show that, in general, claims of correctly identifying the groups with convex relaxations would lead to polynomial 
time solution algorithms for a well-known NP-hard problem, called the weighted maximum cover problem. Instead, 
leveraging a graph-based understanding of group models, we describe group structures which enable correct model 
identification in polynomial time via dynamic programming. We also show that group structures that lead to totally 
unimodular constraints have tractable discrete as well as convex relaxations. Finally, we study the Pareto frontier of 
budgeted group-sparse approximations for the tree-based sparsity model and illustrate identification and computation 
trade-offs between our framework and the existing convex relaxations. 

Index Terms 

Signal Approximation, Structured Sparsity, Interpretability, Tractability, Dynamic Programming, Compressive 
Sensing. 



I. Introduction 

INFORMATION in many natural and man-made signals can be exactly represented or well approximated 
by a sparse set of nonzero coefficients in an appropriate basis pi. Compressive sensing (CS) exploits this 
fact to recover signals from their compressive samples, which are dimensionality reducing, non-adaptive random 
measurements. According to the CS theory, the number of measurements for stable recovery is proportional to 
the signal sparsity, rather than to its Fourier bandwidth as dictated by the Shannon/Nyquist theorem [3]-[5|. 
Unsurprisingly, the utility of sparse representations also goes well-beyond CS and permeates a lot of fundamental 
problems in signal processing, machine learning, and theoretical computer science. 

Recent results in CS extend the simple sparsity idea to consider more sophisticated structured sparsity models, 
which describe the interdependency between the nonzero coefficients (T], (6|-|8j. There are several compelling 
reasons for such extensions: The structured sparsity models allow to significantly reduce the number of required 
measurements for perfect recovery in the noiseless case and be more stable in the presence of noise. Furthermore, 
they facilitate the interpretation of the signals in terms of the chosen structures, revealing information that could 
be used to better understand their properties. 

An important class of structured sparsity models is based on groups of variables that should either be selected 
or discarded together |8|-| 12 1. These structures naturally arise in applications such as neuroimaging [ 13 1, [ 14 1, gene 



expression data (TTJ, p"5| , bioinformatics fT6j , fTJ\ and computer vision JTJ, p"8| . For example, in cancer research, 



the groups might represent genetic pathways that constitute cellular processes. Identifying which processes lead to 



the development of a tumor can allow biologists to directly target certain groups of genes instead of others |15|. 
Incorrect identification of the active/inactive groups can thus have a rather dramatic effect on the speed at which 
cancer therapies are developed. 
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As a result, in this paper, we consider group-based sparsity models, denoted as 6. These structured sparsity 
models feature collections of groups of variables that could overlap arbitrarily, that is = {Gi, . . . ,Gm} where 
each Gj is a subset of the index set {1, ... , N}, with N being the dimensionality of the signal that we model. 
Arbitrary overlaps mean that we do not restrict the intersection between any two sets Gj and Gg from ©. 

In this paper, we address the signal approximation problem based on a known group structures U5. That is, 
given a signal x G M. N , we seek an x closest to it in the Euclidean sense, whose support (i.e., the index set of 
its non-zero coefficients) consists of the union of at most G groups from 0, where G > is a user-defined group 
budget: 

x G argmin < ||x — : supp(z) C M G, S C ©, |5| < G > , 

where supp(z) is the support of the vector z. We call such an approximation as G- group- sparse or in short group- 
sparse. 

More importantly, we seek to identify the G- group- support of the approximation x, that is the G groups that 
constitute its support. We call this the group-sparse model selection problem. The G-group-support of x allows us 
to "interpret" the original signal and discover its properties so that we can, for example, target specific groups of 



genes instead of others p3| or focus more precise imaging techniques on certain brain regions only [19]. As a 
result, we study under which circumstances we can correctly and tractably identify the group-support of a given 
signal. 

Previous work. Recent works in compressive sensing and machine learning with group sparsity have mainly 
focused on leveraging the group structures for lowering the number of samples required for recovering signals (TJ, 



1 1 1 1 , |20|-p2[. While these results have established the importance of group structures, many of these 



works have not fully addressed the relevant issue of model selection. 

For the special case of non-overlapping groups, dubbed as the block-sparsity model, the problem of model 
selection does not present computational difficulties and features a well-understood theory pO} . The first convex 



relaxations for group-sparse approximation [23] considered only non-overlapping groups. Its extension to overlap- 
ping groups [24 1 has the drawback of selecting supports defined as the complement of a union of groups (see also 

Go)). 

For overlapping groups, on the other hand, Eldar et al. (61 consider the union of subspaces framework and 
cast the model selection problem as a block-sparse model selection one by duplicating the variables that belong to 
overlaps between the groups. Their uniqueness condition [6][Prop. 1], however, is infeasible for any group structure 
with overlaps, because it requires that the subspaces intersect only at the origin, while two subspaces defined by 
two overlapping groups of variables intersect on a subspace of dimension equal to the number of elements in the 
overlap. 



The recently proposed convex relaxations [11 1, |22| for group-sparse approximations select group-supports that 



consist of union of groups. However, the group-support recovery conditions in [ 11 1, [22 1 should be taken with care, 
because they are defined with respect to a particular subset of group-supports and are not general. As we numerically 
demonstrate in this paper, the group-supports recovered with these methods might be incorrect. Furthermore, the 



required consistency conditions in 1 11 1, [22] are unverifiable a priori. For instance, they require tuning parameters 
to be known beforehand to obtain the correct group-support. 

Contributions. This paper is an extended version of a prior submission to the IEEE International Symposium 
on Information Theory (ISIT), 2013. This version contains all the proofs that were previously omitted due to lack of 
space, refined explanations of the concepts, and provides the full description of the proposed dynamic programming 
algorithms. 

In stark contrast to the existing literature, we take an explicitly discrete approach to identifying group-supports 
of signals given a budget constraint on the number of groups. This fresh perspective enables us to show that the 
group-sparse model selection problem is NP-hard: if we can solve the group model selection problem in general, 
then we can solve any weighted maximum coverage (WMC) problem instance in polynomial time. However, WMC 
is known to be NP-Hard. Given this connection, we can only hope to characterize a subset of instances which are 
tractable or find guaranteed and tractable approximations. 
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We then present characterizations of group structures that lead to computationally tractable problems via dynamic 
programming. We do so by leveraging a graph-based representation of the groups and exploiting properties of the 
induced graph. We present and describe a novel dynamic program that solves the WMC problem for a specific 
class of group structures and could be of interest by itself. We also identify tractable discrete relaxations of the 
group-sparse model selection problem that lead to efficient algorithms. Specifically, we relax the constraint on the 
number of groups into a penalty term and show that if the remaining group constraints satisfy a property related 
to the concept of total unimodularity p5| , then the relaxed problem can be efficiently solved using linear program 
solvers. We also extend the discrete model to incorporate an overall sparsity constraint and allowing to select 
individual elements from each group, leading to within-group sparsity. Furthermore, we discuss how this extension 
can be used to model hierarchical relationships between variables. We present a novel dynamic program that solves 
the hierarchical model selection problem exactly and discuss a tractable discrete relaxation. 

We also interpret the implications of our results in the context of other group-based recovery frameworks. For 
instance, the convex approaches proposed in (6j, pT] , [22| also relax the discrete constraint on the cardinality of 
the group support. However, they first need to decompose the approximation into vector components whose support 
consists only of one group and then penalize the norms of these components. It has been observed [11] that these 
relaxations produce approximations that are group-sparse, but their group-support might include irrelevant groups. 
We concretely illustrate these cases via a Pareto frontier example. 

Paper structure. The paper is organized as follows. In Section 2, we present definitions of group-sparsity and 
related concepts, while in Section [III] we formally define the approximation and model- selection problems and 



connect them to the WMC problem. We present and analyze discrete relaxations of the WMC in Section IV and 
consider convex relaxations in Section [V] In Section [VT| we illustrate via a simple example the differences between 
the original problem and the relaxations. The generalized model is introduced and analyzed in Section VII while 
numerical simulations are presented in Section VIII We conclude the paper with some remarks in Section IX The 
appendices contain the detailed descriptions of the dynamic programs. 



II. Basic Definitions 

Let x G M. N be a vector and M = {1, . . . , N} be the ground set of its indices. We use |<S| to denote the 
cardinality of an index set S. We use to represent the space of ^-dimensional binary vectors and define 
i : M. N — y ~B N to be the indicator function of the nonzero components of a vector in R , i.e., i( x )« = 1 if Xi ^ 
and i(x)j = 0, otherwise. We let ljy to be the iV-dimensional vector of all ones and Ijy the N x N identity matrix. 
The support of x is defined by the set-valued function supp(x) = {i G M : x-i ^ 0}. Note that we use bold 
lowercase letters to indicate vectors and bold uppercase letters to indicate matrices. 

Definition II. 1. A group structure & = {Qi, . . . ,Gm} is a collection of index sets, named groups, with Qj C J\[ 
and \Qj\ = gj for 1 < j < M and Ugeis Q = N. 

We can represent a group structure as a bipartite graph, where on one side we have the N variables nodes 
and on the other the M group nodes. An edge connects a variable node i to a group node j if i G Qj. Fig. [T] shows 
an example. The bi-adjacency matrix A & G M NxM of the bipartite graph encodes the group structure, 

A & = f 1, if i € Qj] 
l i \ 0, otherwise. 

Another useful representation of a group structure is via a group graph (V, £) where the nodes V are the groups 
Q G <5 and the edge set £ contains eij if Qi H Qj ^ 0, that is an edge connects two groups that overlap. A sequence 
of connected nodes vx,v%, . . . , v n , is a loop if v\ = v n . 

In order to illustrate these concepts, consider the group structure (3 1 defined by the following groups, Q\ = {1}, 
Q2 = {2}, Q 3 = {1,2,3,4,5}, Q A = {4,6}, Q 5 = {3,5,7} and Q 6 = {6,7,8}. (S 1 can be represented by the 
variables-groups bipartite graph of Fig. [j] or by the group graph of Fig. [2] which is bipartite and contains loops. 

An important group structure is given by loopless pairwise overlapping groups. This group structure consists of 
groups such that each element of the ground set occurs in at most two groups and the induced graph does not contain 
loops. Therefore the group graph for these structures is actually a tree or a forest and hence bipartite. For example, 



4 



variables 



groups ft 
Ql 




Q2 <?3 Qa G5 G& 

Fig. 1. Example of bipartite graph between variables and groups induced by the group structure (5 , see text for details. 




Ql Q\ 
Fig. 2. Bipartite group graph with loops induced by the group structure & , where on each edge we report the elements of the intersection. 



consider Q\ = {1,2,3}, Q2 = {3,4,5}, Q3 = {5,6,7}, which can be represented by the graph in Fig. [5|Left). If 
Qs were to include an element from Q\, for example {2}, we would have the loopy graph of Fig. [^Right). Note 
that (5 1 is pairwise overlapping, but not loopless, since Qz,Qi,Qh and Q§ form a loop. 

We anchor our analysis of the tractability of interpretability via selection of groups on covering arguments. 

Definition II.2. A group cover <S(x) for a signal x € is a collection of groups such that supp(x) C Uees(x) Q- 
An alternative equivalent definition is given by 



«S(x) = {Sj € & : u € 



1, A & u > t(x)} 



The binary vector u> indicates which groups are active and the constraint A®uj > t(x) makes sure that, for 
every non-zero component of x, there is at least one active group that covers it. We also say that <S(x) covers x. 
Note that the group cover is often not unique and <S(x) = is a group cover for any signal x. This observation 
leads us to consider more restrictive definitions of group cover. 

Definition II.3. A G-group cover 5 G (x) C (5 is a group cover for x with at most G elements, 

M 

5 G (x) = {Qj e0:we B M , Uj = 1, A & uj > t(x), J^Uj < G} . 

3=1 

It is not guaranteed that a G-group cover always exists for any value of G. Finding the smallest G-group cover 
lead to the following definitions. 



Definition II.4. The group ^o-" n <>rni" is defined as 



x L5 := mm < 



M 



J^Uj : A 6 u > t(x) 



(1) 





{2} Qz 



Fig. 3. (Left) Loopless pairwise overlapping groups. (Right) By adding one element from Q\ into Qz, we introduce a loop in the graph. 
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Definition II.5. A minimal group cover for a signal x G R N is defined as .M(x) = {Qj G : d)(x)j = 1}, where 
ll> is a minimizer for ([T]), 

M 



cj(x) G argmin < 



A minimal group cover Ai (x) is a group cover for x with minimal cardinality. Note that there exist pathological 
cases where for the same group ^o-" norm "> we nave different minimal group cover models. 

Definition II.6. A signal x is G-group sparse with respect to a group structure & if \\x.\\& t Q < G. 

In other words, a signal is G-group sparse if its support is contained in the union of at most G groups from 

III. Tractability of interpretations 

A group-based interpretation of a signal consists in identifying the groups that constitute the support of its 
approximation. In this section, we establish the hardness of finding group-based interpretations of signals in 
general and characterize a class of group structures that lead to tractable interpretations. In particular, we present a 
polynomial time algorithm that finds the correct G-group-support of the G-group-sparse approximation of x, given 
a positive integer G and the group structure 65. 

We first define the G-group sparse approximation x and then show that it can be easily obtained from its G- 
group cover 5 G (x), which is the solution of the model selection problem. We then reformulate the model selection 
problem as the weighted maximum coverage problem. Finally, we present our main result, the polynomial time 
dynamic program for loopless pairwise overlapping group structures. 

Problem 1 (Signal approximation). Given a signal x G R , a best G-group sparse approximation x is given by 

x G argmin { ||x — z 1 1 § : ((zH^o < G} . (2) 

If we already know the G-group cover of the approximation S (x), we can obtain x as xj = xj and xjc = 0, 
where X = Uges G (x) ^ anc ^ 2T C = A/" \ X. Therefore, we can solve Problem 1 by solving the following discrete 
problem. 

Problem 2 (Model selection). Given a signal x G K^, a G-group cover model for its G-group sparse approximation 
is expressed as follows 



S G (x) G argmax I ^ xj : 1 = [j Q \ . 

Uex g&s J 



(3) 



\S\<G 

To show the connection between the two problems, we first reformulate Problem 1 as 



zQS 

which can be rewritten as 



mm < ||x — z 1 1 § : supp 



( z )=z,z= |J g,sc&,\s\ <g\, 

ges J 



mm mm ||x — z|| 2 . 

5C6 z e M. N 

\S\ < G supp(z) = X 

1 = Ug e s g 

The optimal solution is not changed if we introduce a constant, change sign of the objective and consider maxi- 
mization instead of minimization 

n n2 n n2 

max max < x x — z 2 

see zel" [ 

|5| < G supp(z) = X 
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The internal maximization is achieved for x as % = xz and icjc = 0, so that we have, as desired, 

5 G (x) G argmax ||xx||| . 
see 

\S\<G 
x = Ug e s S 

The following reformulation of Problem 2 as a binary problem allows us to characterize its tractability. 

Lemma 1. Given x G M> N and a group structure <5, we have that 5 G (x) = {Qj £ © : cj g = 1}, where (u G ,y G ) 
is an optimal solution of 

N M 



max 



J>z?: A^>y,^ Wj <G . (4) 
, j=i i=i J 

Proof: The proof follows along the same lines as the proof in [26]. Note that in Q, uj and y are binary 
variables that specify which groups and which variables are selected, respectively. The constraint A®u > y makes 
sure that for every selected variable at least one group is selected to cover it, while the constraint Y^jLi ^ G 
restricts choosing at most G groups. ■ 

Problem Q can produce all the instances of the weighted maximum coverage problem (WMC), where the 
weights for each element are given by x\ (1 < i < N) and the index sets are given by the groups Qj G & 
(1 < j < M). Since WMC is in general NP-hard and given Lemma 1, the tractability of ([3]) directly depends on 
the hardness of Q, which leads to the following result. 

Proposition III.l. The model selection problem ([3]) is in general NP-hard. 

It is possible to approximate the solution of ([4]) using the greedy WMC algorithm (27]]. At each iteration, the 
algorithm selects the group that covers new variables with maximum combined weight until G groups have been 
selected. However, we show next that for certain group structures we can find an exact solution. 

Our main result is an algorithm for solving ((4]) for loopless pairwise overlapping groups structures. The proof 
is given in Appendix [A] 

Proposition III.2. Given a loopless pairwise overlapping group structure 0, there exists a polynomial time dynamic 
programming algorithm that solves Q. 



IV. Discrete relaxations 

Relaxations are useful techniques that allow to obtain approximate, or even sometimes exact, solutions while 
being computationally less demanding. In our case, by relaxing the constraint on the number of groups in ([4]) into 
a regularization term with parameter A > 0, we obtain the following binary linear program 

( N M \ 

(w\ y A ) G argmax I £ y iX f - A ^ Uj : A & u > y I (5) 



a>eB A ',yeB 



i=l j=l 



Ml 



We can rewrite the previous program in standard form. Let u T = [y T uj t ] G M N+M , w T = [x\, . . . ,x 2 N , —XI 
and C = [Ijv, —A 6 ]. We then have that ([5]) is equivalent to 

- A G argmax |w T u : Cu < ol (6) 



u 



In general, ([6]) is NP-hard, however, it is well known [25] that if the constraint matrix C is Totally Unimodular 
(TU), then it can be solved in polynomial-time. While the concatenation of two TU matrices is not TU in general, 
the concatenation of the identity matrix with a TU matrix results in a TU matrix. Thus, due to its structure, C is 
TU if A 6 is TU §E\. 

Group structures that can be represented by a bipartite graph, such as the one in Fig. [2] lead to constraint 
matrices A 6 that are TU (25 1. 
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Lemma 2. Loopless pairwise overlapping groups lead to totally unimodular constraints. 

Proof: We first use a result that establishes that if a matrix is TU, then its transpose is also TU [25][Prop.2.1]. 
We then apply [25 [[Corollary 2.8] to A®, swapping the roles of rows and columns. Given a binary matrix whose 
columns can be partitioned into two disjoint sets and with no more than two nonzero elements in each row, this 
result provides two sufficient conditions for it being totally unimodular. In our case, the columns of A 63 can be 
partitioned in two sets, S\ and S2 because the group graph for loopless pairwise overlapping groups is bipartite. 
The two sets represents groups which have no common overlap. Furthermore, each row of A & contains at most two 
nonzero entries due to the pairwise overlap. We can now easily check that the two conditions on A* are satisfied: 



• If two nonzero entries in a row have the same sign, then the column of one is in S\ and the other is in 1S2: 
indeed if an element belongs to two groups, these groups must lie in two different sets; 

• If two nonzero entries in a row have opposite signs, then their columns are both in S\ or both in £2: there 
are no such rows in our case. 



Even though for this group structure we can use the dynamic program of Prop. III.2 for very large problems 
it may be computationally faster to solve the binary linear program. The next proposition establishes when the 
regularized solution coincides with the solution of Q. 

Lemma 3. If the value of the regularization parameter A is such that the solution (u> x , y A ) of (|5]) satisfies Y2j 
G, then (ui x ,~y x ) is also a solution for Q. 



Proof: This lemma is a direct consequence of Prop. IV. 1 below. 



However, as we numerically show in Section VIII given a value of G it is not always possible to find a value 
of A such that the solution of ([5]) is also a solution for Q. Let the set of points V = {G, (/(G))}q =1 , where 
f{G) = Y^h=i yf x t' be the Pareto frontier of Q. We then have the following characterization of the solutions of 
the discrete relaxation. 

Proposition IV.l. The discrete relaxation ([5]) yields only the solutions that lie on the intersection between the 
Pareto frontier of Q, V, and the boundary of the convex hull ofV. 

Proof: On the one hand, the solutions of Q for all possible values of G are the Pareto optimal solutions of 
the following vector-valued minimization problem with respect to the positive orthant of R 2 , which we denote by 

min f(uj,y) 

uel", ySB N (7) 

subject to A 6 w > y 

where /(w, y) = (\\x\\ 2 - Zl^lE-Li^) ^K- 

On the other hand, the scalarization of (|7j) yields the following discrete problem, with A > 

II 112 v^iV 2 1 \ v^M 

mm x \\ z - yixf + A ^ _ x uij 

subject to K & u > y 

whose solutions are the same as for Q. Therefore, the relationship between the solutions of (|4]) and Q can be 
inferred by the relationship between the solutions of ([7]) and ([SJ. It is known that the solutions of ([8]) are also 
Pareto optimal solutions of ([7]), but only the Pareto optimal solutions of ([TJ) that admit a supporting hyperplane 
for the feasible objective values of Q are also solutions of ([8]) [28 1. In other words, the solutions obtainable via 



scalarization belong to the intersection of the Pareto optimal solution set and the boundary of its convex hull. 



N 



{4,5} 




Fig. 4. The group-graph for the example in Section [VT] 



V. Convex relaxations 



For tractability and analysis, convex proxies to the group £o- norm nave been proposed (e.g., \22\) for finding 



group-sparse approximations of signals. Given a group structure <5, an example generalization is denned as 

M M 



A m v\...,v M ei" 

Vj, supp(v^ ) = Qj 



X) d iHip : E vi = x ^ (9) 



where ||x|| p = fX)£Li is the ^ p -norm, and dj are positive weights that can be designed to favor certain 

groups over others [11]. This norm can be seen a weighted instance of the atomic norm described in [8], where 
the authors leverage convex optimization for signal recovery, but not for model selection. 

One can in general use (|9]) to find a group-sparse approximation under the chosen group norm 

x G argmin{||x - z||| : ||z|| e r 1)P} < A} , (10) 



where A > controls the trade-off between approximation accuracy and group-sparsity. However, solving ( [10] ) does 
not yield a group-support for x: even though we can recover one through the decomposition {v J } used to compute 
||x|| g/i^i, it may not be unique as observed in fffj for p = 2. In order to characterize the group-support for x 
induced by ((9]), in |llj the authors define two group-supports for p = 2. The strong group-support <S(x) contains the 
groups that constitute the supports of each decomposition used for computing Q. The weak group-support <S(x) is 
defined using a dual-characterisation of the group norm (|9]). If <S(x) = <S(x), the group-support is uniquely defined. 
However, [11] observed that for some group structures and signals x, even when <S(x) = <S(x), the group-support 
does not capture the minimal group-cover of x. Hence, the equivalence of £q and l\ minimization (3j, Q in the 
standard compressive sensing setting does not hold in the group-based sparsity setting. 



VI. Case study: discrete vs. convex interpretability 



The following stylized example illustrates situations that can potentially be encountered in practice. In these 
cases, the group-support obtained by the convex relaxation will not coincide with the discrete definition of group- 
cover, while the dynamical programming algorithm of Prop. III.2| is able to recover the correct group-cover. 

Let M = {1, ... , 11} and let <S = {Qi = {1, . . . , 5}, Q 2 = {4, . . . , 8}, G 3 = {7,..., 11}} be the loopless 
pairwise overlapping groups structure with 3 groups of equal cardinality. Its group graph is represented in Fig. |4] 
Consider the 2-group sparse signal x = [0 011101110 0] T , with minimal group-cover M(x) = {Gi,Gs}- 

The dynamic program of Prop. III.2 with group budget G = 2, correctly identifies the groups Q\ and Q 3 . The 
TU linear program ([5]), with < A < 2, also yields the correct group-cover. Conversely, the decomposition obtained 
via ([9]) with unitary weights is unique, but is not group sparse. In fact, we have <S(x) = <S(x) = (6. We can only 
obtain the correct group-cover if we use the weights [1 d 1] with d > that is knowing beforehand that Q2 is 
irrelevant. 



Remark 1. Indeed, if the convex relaxation always recovered the correct minimal group-cover, it would be possible 
to solve the discrete NP-hard problem in polynomial time. 
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Fig. 5. Hierarchical constraints. Each node represent a variable. (Left) A valid selection of nodes. (Right) An invalid selection of nodes. 



VII. Generalizations 

In this section, we first present a generalization of the discrete approximation problem (|4]) by introducing an 
additional overall sparsity constraint. Secondly, we show how this generalization encompasses approximation with 
hierarchical constraints that can be solved exactly via dynamic programming. Finally, we show that the generalized 
problem can be relaxed into a linear binary problem and that hierarchical constraints lead to totally unimodular 
matrices for which there exists efficient polynomial time solvers. 



A. Sparsity within groups 

In many applications, for example genome-wide association studies (T7J, it is desirable to find approximations 
that are not only group-sparse, but also sparse in the usual sense (see (291 for an extension of the group lasso). To 
this end, we generalize our original problem Q by introducing a sparsity constraint K and allowing to individually 
select variables within a group. The generalized integer problem then becomes 



max 



N N M 

J>x 4 2 : A & u > y,^ yi < K^uj < G \ . (11) 

i=l i=l j=l 



This problem is in general NP-hard too, but it turns out that it can be solved in polynomial time for the same 
group structures that allow to solve Q. 

Proposition VII.l. Given a loopless pairwise overlapping groups structure <5, there exists a polynomial time 
dynamic programming algorithm that solves ( |11[ ). 

Proof: The dynamic program is described in Appendix [A] alongside the proof that it has a polynomial running 
time. ■ 



B. Hierarchical constraints 

The generalized model allows to deal with hierarchical structures, such as regular trees, frequently encountered 
in image processing (e.g. denoising using wavelet trees). In such cases, we often require to find i^-sparse approx- 
imations such that the selected variables form a rooted connected subtree of the original tree, see Fig. [5] Given a 
tree T, the rooted-connected approximation can be cast as the solution of the following discrete problem 



max 



' N \ 

Sw*? -y g7 ~k \ » (!2) 
. i=i ) 

where Tk denotes all rooted and connected subtrees of the given tree T with at most K nodes. 

This type of constraint can be represented by a group structure, where for each node in the tree we define a 
group consisting of that node and all its ancestors. When a group is selected, we require that all its elements are 
selected as well. We impose an overall sparsity constraint K, while discarding the group constraint G. 



For this particular problem, for which convex approximations have been proposed 1 30 1 , we present an exact 
dynamic program that runs in polynomial time. 
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Proposition VII.2. Given a hierarchical group structure (5, there exists a polynomial time dynamic programming 
algorithm that solves ( fT2| ). 

B 



Proof: The description of the algorithm and the proof of its polynomial running time can be found in Appendix 



C. Additional discrete relaxations 



By relaxing both the group budget and the sparsity budget in ( flT) into regularization terms, we obtain the 
following binary linear program 

0\y A )€ argmax |w T u : u T = [y T u T y T ], Cu < o) (13) 

where w T = [x\ , . . . , x 2 N , — AgI^ , — ^kIJj] and C = [Ijv, —A®, On] and \g,^k > are two regularization 
parameters that indirectly control the number of active groups and the number of selected elements. ( p"3"] ) can 
be solved in polynomial time if the constraint matrix C is totally unimodular. Due to its structure, C is totally 
unimodular if A & is totally unimodular [25]. The next results proves that the constraint matrix of hierarchical group 
structures is totally unimodular. 

Proposition VII.3. Hierarchical group structures lead to totally unimodular constraints. 

Proof: We use the fact that a binary matrix is totally unimodular if there exists a permutation of its columns 
such that in each row the Is appear consecutively (251. For hierarchical group structures, such permutation is given 
by a depth-first ordering of the groups. In fact, a variable is included in the group that has it as the leaf and in 
all the groups that contain its descendants. Given a depth-first ordering of the groups, the groups that contain the 
descendants of a given node will be consecutive. ■ 

VIII. Pareto Frontier Example 

The purpose of this numerical simulation is to illustrate the limitations of relaxations for correctly estimating 
the G-group cover of an approximation. We consider the problem of finding a K-sparse approximation of a signal 
imposing hierarchical constraints. We generate a piecewise constant signal of length N = 64, to which we apply 
the Haar wavelet transformation, yielding a 25-sparse vector of coefficients x that satisfies hierarchical constraints 
on a binary tree of depth 5, see Fig. j6jLeft). 

We compare the proposed dynamic program (DP) to the regularized totally unimodular linear program approach 
and two convex relaxations that use group-based norms. The first |8j uses the Latent Group Lasso penalty ( [T0| ) with 
groups defined as all father-child relations in the tree. This formulation will not enforce all hierarchical constraints 
to be satisfied, but will only 'favor' them. Therefore, we also report the number of hierarchical constraint violations. 



The second 1 30 1 considers a hierarchy of groups where Qj contains node j and all its descendants. Hierarchical 
constraints are enforced by the group lasso penalty Q,gl(^-) = Sge© H x gII2> where xg is the restriction of x to Q. 
We call this method Hierarchical Group Lasso. Once we determine the support of the solution, we assign to the 
components in the support the values of the corresponding components of the original signal. 

In Fig. ^Right), we show the approximation error ||x — x||| as a function of the solution sparsity K for the 
methods. The values of the DP solutions form the discrete Pareto frontier of the optimization problem controlled 
by the parameter K. Note that there are points in the Pareto frontier that do not lie on its convex hull, hence these 
solutions are not achievable by the TU linear relaxation. We observe that the Hierarchical Group LasscQalso misses 
the solutions for K = 21 and K = 23, while the Latent Group Lasso^] approach achieves more levels of sparsity 
(but still missing the solutions for K = 2, 13 and 15), although at the price of violating some of the hierarchical 
constraints. These observations lead us to conclude that, in some cases, relaxations of the original discrete problem 
might not be able to find the correct group-based interpretation of a signal. 



We used the code provided in http://spams-devel.gforge.inria.fr/. 

2 We used the duplication of variables approach and solved the resulting Group Lasso problem using SpaRSA: http://www.lx.it.pt/~mtf/ 
3paRSA/| 
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Signal Approximation on Wavelet Tree 




Fig. 6. (Left) (a) Original piecewise constant signal, (b) Haar wavelet representation. (Right) Signal approximation on a binary tree. The 
original signal is 25-sparse and satisfies hierarchical constraints. The numbers next to the Latent Group Lasso solutions indicate the number 
of constraint violations. 
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Fig. 7. Characterization of tractability for group-based interpretations. 



IX. Conclusions 

Many applications benefit from group sparse representations. Unfortunately, our result in this paper shows that 
finding a group-based interpretation of a signal is an integer optimization problem, which is in general NP-hard. 
To this end, we characterize group structures for which a dynamical programming algorithm can find a solution in 
polynomial time and also delineate discrete relaxations for special structures (i.e., totally unimodular constraints) 
that can obtain correct solutions in special circumstances. 

Our examples and numerical simulations show the deficiencies of convex relaxations. We observe that such 
methods only recover group-covers that lie in the convex hull of the Pareto frontier determined by the solutions of 
the original integer problem for different values of the group budget G (and sparsity budget K for the generalized 
model). This, in turn, implies that convex and non-convex relaxations might miss some important groups or include 
spurious ones in the group-sparse model selection. We summarize our findings in Fig. [7] 

Appendix A 

Dynamical programming for solving {□} for loopless pairwise overlapping groups 



Here, we give the proof of Prop. VII. 1 The proof of Prop. III.2 follows along similar lines. 
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Proof: The proof consists in describing the algorithm and showing it is polynomial time. (fTTJ can be equiv- 
alently described by the following problem: given a signal x 6 M. N and a group structure consisting of M 
groups defined over the index set M = {1, . . . , N}, with each index having an associated (non-negative) weight 
(i.e., xf, Vie A/"), find the optimal selection of at most K indices, to maximize the sum of their weights, such 
that the indices are contained in a union of at most G groups. 

We highlight the optimal substructure inherent in this problem, which allows us to solve it using a dynamic 
programming approach. The optimal substructure is somewhat involved: we represent it below by two properties. 

1) Suppose we know the G groups that constitute the optimal solution. Then the optimal choice of elements 
corresponds to picking the K largest weight elements belonging to the union of the G groups. 

2) Suppose we know the groups and elements selected in the optimal solution under a G-groups and A'-elements 
constraint. Partition the set of chosen groups into two sets, S\ and S2 , consisting of g x and #2 groups respectively 
(51 + 92 = G). Suppose S\ contains k\ of the elements in the optimal solution, and suppose £2 contains 
additional k2 elements excluding elements covered by S\ (k\ + fc 2 = K). Then, given the selection of groups 
and elements in Si, S2 represents the optimal selection of at most &2 elements from at most §2 groups in Sf 
(i.e. (5 \ <Si), after the elements in Si have been removed from the groups in Sf. 

These properties lead us to a dynamic programming based method for obtaining the solution. The underlying 
graph has as nodes groups in 0. The algorithm gradually explores every node in the group-graph, storing the 
optimal solution among the visited nodes and it is defined by two rules: the Node Picking Rule controls how the 
graph must be explored in order to minimize the number of values to store; the Value Update Rule describes how 
the stored values are updated when a new node is considered. Due to the looplessness constraint, the graph can be 
represented as a tree or a forest. Choose an arbitrary node and call it the root node. 

Let be the set of all nodes and let S C be the set of currently explored nodes with \S\ = m. Define 
3-valued indicator variables, I\, h, ■ ■ ■ , Im f° r eacri of the M nodes. Ij = 1 indicates that the jth node is selected, 
Ij = shows that it is forbidden, while Ij = 2 represents a "don't care" state: there is no restriction on the jth 
node being either chosen or not. For unexplored nodes, the indicator variables are always in the "don't care" state. 
At every step of the algorithm, we store the optimal value for choosing at most k elements, from at most g nodes, 
1 < k < K and 1 < g < G from the currently explored set of nodes, S. These optimal values are stored in the 
function, F(S, g, k, I x , J 2 , . . . , Im)- 

We define an additional function H(S, k) to represent the optimal value for choosing k elements from a set S. 
The set S could be a single group, a union of groups, or any well-defined collection of elements. As noted earlier, 
the optimal selection corresponds to simply picking the k largest weight elements in S. 

Our aim is to obtain the value: -F(0, G, K, Ii = 2, I2 = 2, . . . , Im = 2). All indicator variables are set to 2, as 
we do not care about any particular group being selected or rejected in the final selection. Notice that by definition, 

F(S,g,k,I x = ii, ... ,Ij = 2,. . .,I M = im) = max{F(5,5, k,h = i x , . . . , Ij = 0, . . . , I M = »m), 

F(S,g,kJi = h, . . . ,Ij = 1, . . . ,I M =im)}, 

i.e., the optimal value when we do not care about a particular group being selected, is simply the maximum over 
the two cases of the group being selected and being rejected, ceteris paribus. 

Note that the function F has an input space which is exponential in M (since the indicator variables combined 
can take exponentially many values). Therefore, if we tried to determine the values of F at all possible points, we 
would need an exponential amount of space and time. However, we shall see that our algorithm needs to work 
with only a small set of values of F, and hence runs in polynomial time. This happens because the values of the 
indicator variables will be important only for certain specific nodes, called boundary nodes. We define a boundary 
node as an explored node adjacent to an unexplored node. Hence the groups defined by the boundary node and 
the adjacent unexplored node overlap. 

Base Case. We start by taking 5 = 0. For this case, all values of F will be set to 0: F($,g,k,Ii = ii,h = 
12, ■ ■ ■ , hi = hi) =0Vg,k,ii,..., im- 

Assume that we have ordered the nodes from 1 to M according to some criteria. We shall now explore the 
nodes in this order and use the following value-update rules. 
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Value Update Rule. Suppose at a particular step, we have explored the first m nodes. We assume that we have 
stored the values of F for each g and k. Further, we assume that we have stored the values of F separately for each 
value of the indicator variable for each boundary nodes. Denote by Si the set of the first i nodes. Denote by B m 
the set of boundary nodes when m nodes have been explored. Denote by X m the mth group. We assume that the 
following values are available at this step: F(S m ,g, k,I\ = i\, . . . , Im = *m) f° r all 1 < g < G and 1 < k < K 
and all i\, . . . , %m such that ij G {0, 1} for j G B m and ij = 2 for j G {1, . . . , M} \ B m . Note that the indicator 
variables for all non-boundary nodes, as well as the unexplored nodes are set equal to 2. Thus the total number of 
values that have to be stored equals GK2\ Bm \ 

The value update rule is divided into 3 cases and a final condensation step. 

1) The new node is rejected. All optimal values for all k and all g remain the same as for m nodes. The added 
node is treated as a new boundary node and the stored values are associated to the new node being rejected. 

F{S m +i,9, k, h = h, ■ ■ ■ , I m = im, Im+i = 0, I m+ 2 = 2, . . . , Im = 2) 
= F(S m ,g, k,h = h, . . . ,I m = i m ,I m +i = 2, ... ,7m = 2) 

for all 1 < g < G and 1 < k < K and all i\, . . . ,im such that ij G {0, 1} for j G B m and ij = 2 for 
j G {l,...,m}\B m . 

2) The new node is accepted (no overlap with any explored node). Since the new node is selected, we can choose 
at most g — 1 explored nodes. We first compute the sum of the optimal value for choosing the best £ elements 
from the new node and the optimal value for choosing k — £ elements from the g — 1 explored nodes, for any 
£ such that 1 < £ < k. Then, the new optimal value for each g and k is given by taking the maximum of these 
sums over £. 

F(S m+ i,g, k,I± = i±, . . . , I m = i m , I m +i = 1, I m +2 = 2, . . . , Im = 2) 
= max {F(S m ,g - 1, k — £, h = h, . . . , l m = i m ,I m +i = 0, . . . , I M = 2) + H(X m+1 , £)} 

l<£<k 

for all 1 < g < G and 1 < k < K and all . . ,%m such that ij G {0, 1} for j G B m and ij = 2 for 
j G {1,. . . ,m} \ B m . 

3) The new node is accepted (overlaps with some explored nodes). The update rule is the same as for case 2, 
but the elements in the region of overlap between the new node and the selected explored nodes must not be 
considered as being part of the new node. In other words, the new node must be 'cleaned' of the overlapping 
region before updating. For this step, we need to know exactly which nodes have been chosen while computing 
an optimal value. This is the reason why we need to store separate values for each boundary node. We further 
assume that the cleaning operation can be done in 0(1) time, leading to a total complexity of 0(GK 2 2\ Bm \). 

F(Sm+i,9, k, I\ = i±, . . . ,I m = i m , I m +i = 1, im+2 = 2, . . . , Im = 2) 

= max {F(S m ,g - 1, k — £, h = h, . . . , I m = i m ,I m +i = 0, . . . , I M = 2) + H(X' m+1 ,£)\ 

l<£<k 

for all 1 < g < G and 1 < k < K and all ii, . . . ,zm such that ij G {0, 1} for j G B m and ij = 2 for 
j G {1, . . . , m} \ B m . We also define X' m+1 = X rn+1 \ \J jeB Xj, with B = {j G B m : ij = 1}. That is we 
"clean" X m of the overlap with the currently selected boundary nodes. 

4) Condensation. After these steps, the number of stored values will be (at most) doubled. We can reduce them: 
for each boundary node which has fallen into the interior of the explored nodes, we combine the optimal 
values for it being picked or unpicked, into a single value by taking the larger of the two values. Each such 
operation reduces the number of stored values by half and we perform it after each value update 

F(S m+1 ,g, k, h = h, ■ ■ ■ ,Ij = 2, . . . ,I M = im) = raax{F(S m ,g, k, h = h, ■ ■ ■ , Ij = 0, . . . , I M = ijtf), 

F(S m , g, k, h = h, ■ ■ ■ , Ij = 1, • • ■ , Im = *m)} 
for all j G (B m U X m+ \) \ B m+ \ and for all 1 < g < G and 1 < k < K and for alHi, . . . , 

Time Complexity. Let B be the maximum number of boundary nodes encountered by the algorithm, then 
the number of steps is bounded by 0(2 B K 2 GM). We now give an algorithm to explore the graph so that B is 
logarithmic in M, establishing polynomial complexity. 
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Fig. 8. Node Picking Rule: explore nodes in the order 7i , root, 72 , Tz where D\ > D2 > -D3. For the subtree 71, the node connected to 
root should be considered the root of 71, which we denote by Ri; similarly for the other subtrees. 

Node Picking Rule. In order to minimize the number of boundary nodes encountered by the algorithm, we 
must explore the graph in a particular fashion. The order with which the nodes are picked is determined by a 
value associated to each subtree of the graph, which we call the D-value. In the following we describe how it is 
computed, how it depends logarithmically on the number of nodes in the graph and how the number of boundary 
nodes is bounded by the D-value. 

The Node Rule Picking rule is defined as follows. We first order all rooted subtrees with respect to the 
the D- value, so that D\ > ... > Dr for subtrees T\,Ti, . . . ,Tr. We then pick the subtrees in the order 
{Ti, root, T2, . . . , Tr} and recurse until the explored subtree has only one node, see Fig. [8] 

The procedure for computing the D-values is also recursive. If the tree has only one node, D = 1. Now, assume 
the subtrees at a node Q have values D\ > . . . > Dr. Then, D(Q) = max(Di, D 2 + 1). In case there is no 2 nd 
subtree, D 2 = 0. We then have the following bound on the D-values. 

Lemma 4. The D-value of a rooted tree graph is logarithmic in the number of nodes, i.e. D(G) < log 2 (M) + 1. 

Proof: Let D be a positive integer and N(D) be the minimum number of nodes that a rooted tree must have 
in order to have D-value of D. We prove by induction that 

N(D) > 2 D ~ 1 . (14) 

Base case: D = 1. A tree with only one node will have a TJ-value of 1. Hence ( p"4"| ) is satisfied. 

Inductive case: D > 0. Let T be the smallest rooted tree graph whose D-value is equal to D. Spread out T 
in the form of root and subtrees. Let the subtrees be 71, 7i, • • • , Tk', with corresponding D- values D\, D2, ■ ■ ■ , D^. 
Without loss of generality, assume that D\ > Di > . . . > D^. By definition, D(T) = max(Di, D2 + 1). 

By our assumption, T is the smallest graph with D-value equal to D, hence we cannot have D\ = D{G) = D, 
since that would give us a smaller rooted tree graph (71) with a D-value of D. This means that D\ < D, and hence 
D 2 + 1 = D, i.e. D 2 = D — 1. Since D\ > D 2 = D — 1 and D\ < D, then D 1 = D 2 = D - 1. Thus, the graph 
7" has 2 subtrees (71 and Ti), with D-values of D — 1 each. By definition, any rooted subtree with a D-value of 
D — 1 must have at least N(D — 1) nodes. By our induction hypothesis, N(D — 1) > 2 D ~ 2 . Therefore, 7" has at 
least 2 x 2 D ~ 2 = 2 D_1 nodes. But since T was the smallest rooted tree graph with D-value of D, this means that 
N(D) > 2 D - 1 , as required. ■ 

We now link the number of boundary nodes visited by the algorithm to the D-value of the group graph. 
Lemma 5. The total number of boundary nodes encountered by the node-picking algorithm cannot exceed the 
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D-value of the graph. 
Proof: 

Let T be the given rooted tree graph, with M nodes. We shall consider the number of boundary nodes when 
there is a ghost node connected to the root node. This implies that the root, when explored, will always be counted 
as a boundary node. The ghost node captures the fact when we are running the algorithm recursively on a subtree, 
there will be an additional (potentially unexplored) node connected to the root of the subtree, which may lead to the 
root being counted as a boundary node. Let B* (T) denote the maximum number of boundary nodes encountered 
on T when we pick nodes according to our algorithm, and let Bq{T) represent the same when we also have the 
ghost node. Clearly, Bq(T) > B*(T), hence it is enough to prove the following: 

BUT) < D(T). (15) 

We prove this by induction on M. 

Base Case. Suppose the rooted tree graph T has only 1 node. Then the maximum number of boundary nodes 
encountered is obviously 1, which is equal to the D-value of the graph (by definition). Hence Bq(T) < D(T). 

Inductive Case. When the graph 7~ consists of M nodes, M > 1, consider the graph to be spread out in the form 
of root and subtrees. Compute the D-values for each subtree, where w.l.o.g., D\ > D% > . . . Dk- Let 71 , 75, • • • , Tk 
be the corresponding subtrees. Our algorithm explores nodes in the sequence: 71 , root, 75 , Tjj , ---Tk- 

Since each subtree has strictly fewer than M nodes, each subtree satisfies ( fT5] l by the induction hypothesis. Also, 
notice that when exploring the subtree 71 of 7~, the number of boundary nodes encountered is less than or equal 
to the number of boundary nodes encountered when exploring 71 as a standalone rooted-tree-graph, with a ghost 
node connected to its root. By construction, this is exactly equal to Bq(Ti), which by our induction hypothesis 
is bounded by D\. Therefore, the number of boundary nodes encountered while exploring 71 in T cannot exceed 
D\. Once we are finished with 71, we pick the root, so the total number of boundary nodes is 1. We now proceed 
to pick 72- By a similar argument, the maximum number of boundary nodes in 75 at any point cannot exceed 
the number of boundary nodes encountered while exploring 75 as a standalone graph with attached ghost node. In 
addition, the root of T can contribute at most 1 additional boundary node (In fact, the ghost node for T ensures 
that the root, once picked, will always contribute an additional boundary node). Therefore, the total number of 
boundary nodes in T while exploring 75 is at most D2 + 1. Similar arguments hold for all other subtrees — the 
maximum number of boundary nodes while exploring the k-th subtree will be at most Dk + 1, which is at most 
D 2 + l. 

Therefore, the maximum number of boundary nodes encountered at any step while exploring T is Bq(T) < 
max(/Ji, D 2 + 1). By definition, D(T) = max(D 1 ,D 2 + 1). Therefore B* G {T) < D(T). ■ 
Combining Lemmas |4] and [5j we have the following result. 

Proposition A.l. The maximum number of boundary nodes at any step of the algorithm is logarithmic in the 
number of nodes, i.e., B < log 2 (M) + 1. 

The previous proposition establishes the polynomial time complexity of the dynamic program for solving the 
generalized integer problem ( fTT] ). 

Theorem 1. The proposed dynamic program solves, in polynomial time, the problem of Weighted Maximum Cover 
with an additional constraint on element sparsity for loopless pairwise overlapping groups. In particular, its time 
complexity is 0(M 2 GK 2 ), where M is the number of groups, G is the group sparsity constraint and K is the 
element sparsity budget. 



Appendix B 



Dynamical programming for solving the hierarchical signal approximation problem (12i 



Here, we give the proof of Prop. VII.2 We start describing the dynamic program and then prove that its running 
time is polynomial. 
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Fig. 9. Example of a nested subproblem in hierarchical groups model 



Proof: Problem ( [12] ) can be equivalently rephrased as the following optimization problem. Given a rooted tree 
T with each node having at most D children, a non-negative real number (weight) assigned to every node and a 
positive integer K, choose a subset of its nodes forming a rooted-connected subtree that maximizes the sum of 
weights of the chosen elements, such that the number of selected nodes does not exceed K. In our case, §V2\ , the 
weight of a node is the square of the value of the component of the signal associated to that node. The proposed 
algorithm leverages the optimal substructure of the problem. 

Nested Sub-problems. Suppose that a particular node X belongs to the optimal K-no&e. rooted-connected 
subtree. Consider the subtree Tx,d obtained by choosing X, d of its children (1 < d < D) and all descendants of 
these children. Consider the set of nodes S consisting of all the nodes of Tx,d which are also present in the optimal 
if-node rooted-connected subtree. Suppose there are L nodes in S. Then the nodes in S form the optimal L-node 
rooted-connected subtree at X, for the subgraph Tx,d- See Fig. [9] for an example. 

Dynamic Programming method. For every node X, we store the weight of the optimal £>node rooted-connected 
subtree at X, using only the nodes in the d rightmost children of X and their descendants, for each k and d such 
that 1 < k < K and 1 < d < D. We define a function F(X, k,d), to store these optimal values. We start from 
the leaf nodes and move upwards, for each node assessing all its subtrees from right to left, eventually covering 
the entire tree. At the end, the optimal value will be given by F(root, K, D), that is the value of the best K-node 
rooted connected subtree of the root considering all its descendants. 

Base Case. For every leaf node X and for all 1 < k < K and 1 < d < D, we set F(X, k, d) = Weight(X). 

Inductive Case. By induction, for every non-leaf node X, all the F-values are known for the descendants of 
X. Let Xi,X2, ■ ■ ■ X& be the d children of X in the right-to-left order, where 1 < d < D. Then, we compute the 
F-values of X using the following value update rules. 

Value Update Rules. 

1) For all 1 < k < K 

F(X, k, 1) = Weight(X) + F(X U k — l,D). 

The optimal value for choosing a /c-node subtree rooted at X, when only the rightmost child X\ is allowed, 
equals the weight of X itself (since X must be chosen), plus the optimal value for choosing a rooted connected 
subtree with k — 1 nodes from the rightmost child X\. 

2) For all 1 < k < K and 1 < i < d 

F(X, k, i) = max {F(X, £,i-l) + F(X d , k - £, D)} . 

l<£<k 

For choosing the best A;-node rooted connected subtree from the rightmost d children, choose a positive integer 
£ < k, pick the best £-node subtree at X by including the rightmost d — 1 children and pick the remaining 
k — £ nodes from the subtree of the dth child. We then take the maximum over all £, 1 < £ < k (since at least 
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1 node must be chosen from the rightmost d — 1 nodes, this node will be the root). 
3) For all 1 < k < K and d < i < D 

F(X,k,i) = F(X,k,d) . 

For convenience, when a node has only d children, where d is strictly less than D, we set F-values for cases 
involving more than d children equal to the value for d children. 

Theorem 2. The time complexity of the dynamic program for hierarchical structures is polynomial in the number 
of nodes. 

Proof: Given the description of the algorithm above, we observe that we need to calculate at most NDK 
F-values, and for calculating each value, we need to evaluate at most K sums. Therefore the time required will be 
0(NDK 2 ). M 
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